1 Introduction

This is my first project, which is the Google data analytics certificate capstone project named “Cyclistic bike share”. For the analysis purpose i will use R studio. I will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, I will follow the steps of the data analysis process as shown below:

  • Ask

  • Prepare

  • Process

  • Analyze

  • Share

  • Act

1.1 Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

1.2 About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Lily Moreno (marketing director) believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

2 Ask

A clear statement of the business task: Cyclistic finance team concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, They believe that maximising the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, they believe there is a very good chance to convert casual riders into members. They note that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.The company has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno has assigned you the first question to answer:

How do annual members and casual riders use Cyclistic bikes differently?

2.1 Key stakeholders

  • Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. I am hypothetically part of the analytics team as a junior data analyst.

  • Executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

  • Lily moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

3 Prepare

At this phase asking relevant questions are key to preparing the data before moving to process and analyze phase:

Addressing data location, organisation, credibility, privacy and integrity

Q1: Where is your data located?

Ans: We will use Cyclistic’s historical trip data to analyse and identify trends. The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable us to answer the business questions. (The previous 12 months data is made available by Cyclistic bike share company and the link is as follows: https://divvy-tripdata.s3.amazonaws.com/index.html )

Q2: How is the data organised?

Ans: The data is organised as monthly rider data in csv file format. The data contains following columns: data contain ride_id, rideable_type, start_station_name, start_station_id, end_station_name, end_station_id, member_casual,start_lat, start_lng, end_lat, end_lng,started_at(day and time), ended_at(day and time).

Q3: Are there issues with bias or credibility in this data? Does your data ROCCC?

Ans: As we know that ROCCC stands for reliable, original, comprehensive, current and cited. As data was collected by cyclistic bike sharing company, the data collected is reliable, original, formatted in csv file format for each month, current and cited.

Q4: How are you addressing licensing, privacy, security, and accessibility?

Ans: This is public data that we can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit us from using riders’ personally identifiable information. Therefore we won’t have the information related to customers credit card numbers, names and their addresses. (The data has been made available by Motivate International Inc. under this license.)

Q5: How did you verify the data’s integrity?

Ans: I have checked the data for errors and cleaning/manipulating the data to verify its integrity since it is crucial to verify the data’s integrity otherwise our whole analysis would go wrong.

Let’s start preparing our data:

3.1 Install packages and load

install.packages("tidyverse", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("janitor", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("dplyr", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("ggmap", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("skimr", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("lubridate", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("ggplot2", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("readr", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("googlesheets4", repos='http://cran.us.r-project.org', dependencies = TRUE, INSTALL_opts = '--no-lock')
## 
##   There is a binary version available but the source version is later:
##               binary source needs_compilation
## googlesheets4  1.0.1  1.1.0             FALSE
## installing the source package 'googlesheets4'
install.packages("knitr", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("imager", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
install.packages("rmarkdown", repos='http://cran.us.r-project.org')
## 
## The downloaded binary packages are in
##  /var/folders/b8/3t5ydjnn2rx_j2hckhjh2m800000gn/T//Rtmp0Hd6g3/downloaded_packages
## loading the packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(ggmap)
## ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(skimr)
library(lubridate)
library(ggplot2)
library(readr)
library(googlesheets4)
library(knitr)
library(imager)
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:ggmap':
## 
##     inset
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## 
## Attaching package: 'imager'
## 
## The following object is masked from 'package:magrittr':
## 
##     add
## 
## The following object is masked from 'package:stringr':
## 
##     boundary
## 
## The following object is masked from 'package:dplyr':
## 
##     where
## 
## The following object is masked from 'package:tidyr':
## 
##     fill
## 
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## 
## The following object is masked from 'package:graphics':
## 
##     frame
## 
## The following object is masked from 'package:base':
## 
##     save.image
library(rmarkdown)
getwd()
## [1] "/Users/gebruiker/Documents/R/Cyclistic bike share project"

3.2 Importing data in Rstudio

Jan22 <- read_csv("202201-divvy-tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Feb22 <- read_csv("202202-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Mar22 <- read_csv("202203-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Apr22 <- read_csv("202204-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
May22 <- read_csv("202205-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Jun22 <- read_csv("202206-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Jul22 <- read_csv("202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Aug22 <- read_csv("202208-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sep22 <- read_csv("202209-divvy-publictripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Oct22 <- read_csv("202210-divvy-tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Nov22 <- read_csv("202211-divvy-tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Dec22 <- read_csv("202212-divvy-tripdata.csv")
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3.3 Checking columns consistency

colnames(Jan22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Feb22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Mar22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Apr22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(May22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Jun22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Jul22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Aug22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Sep22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Oct22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Sep22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Oct22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Nov22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(Dec22)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
## Checking data structure
str(Jan22)
## spc_tbl_ [103,770 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:103770] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:103770] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:103770] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(Feb22)
str(Mar22)
str(Apr22)
str(May22)
str(Jun22)  
str(Jul22)
str(Aug22)
str(Sep22)
str(Oct22)
str(Nov22)
str(Dec22)

3.4 Combine datasets

## Using rbind function to combine all datasets into large one dataframe
bike_rides <- rbind(Jan22, Feb22, Mar22, Apr22, May22, Jun22, Jul22, Aug22, Sep22, Oct22, Nov22, Dec22)
## Removing individual data sets to free up memory
rm(Jan22, Feb22, Mar22, Apr22, May22, Jun22, Jul22, Aug22, Sep22, Oct22, Nov22, Dec22)

4 Process

Key tasks

  1. Check the data for errors.
  2. Choose your tools.
  3. Transform the data so you can work with it effectively.
  4. Document the cleaning process.

Deliverable

Documentation of any cleaning or manipulation of data.

Let’s dive into processing our data

4.1 Inspecting the data frame

head(bike_rides)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 C2F7DD78E82EC… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
## 2 A6CF8980A652D… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
## 3 BD0F91DFF741C… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
## 4 CBB80ED419105… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
## 5 DDC963BFDDA51… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
## 6 A39C6F6CC0586… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id
class(bike_rides)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
dim(bike_rides)
## [1] 5667717      13
colSums(is.na(bike_rides))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             833064             833064             892742             892742 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5858               5858 
##      member_casual 
##                  0
summary(is.na(bike_rides))
##   ride_id        rideable_type   started_at       ended_at      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5667717   FALSE:5667717   FALSE:5667717   FALSE:5667717  
##                                                                 
##  start_station_name start_station_id end_station_name end_station_id 
##  Mode :logical      Mode :logical    Mode :logical    Mode :logical  
##  FALSE:4834653      FALSE:4834653    FALSE:4774975    FALSE:4774975  
##  TRUE :833064       TRUE :833064     TRUE :892742     TRUE :892742   
##  start_lat       start_lng        end_lat         end_lng       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5667717   FALSE:5667717   FALSE:5661859   FALSE:5661859  
##                                  TRUE :5858      TRUE :5858     
##  member_casual  
##  Mode :logical  
##  FALSE:5667717  
## 
colnames(bike_rides)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
summary(bike_rides)
##    ride_id          rideable_type        started_at                    
##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
##                                        Mean   :2022-07-20 07:21:18.74  
##                                        3rd Qu.:2022-09-16 07:21:29.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:40:45.33                                        
##  3rd Qu.:2022-09-16 07:39:03.00                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5667717    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00                     
##  NA's   :5858    NA's   :5858

4.2 Processing the data frame

# Removing the duplicates
bike_rides %>% distinct()
## # A tibble: 5,667,717 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 C2F7DD78E82E… electr… 2022-01-13 11:59:47 2022-01-13 12:02:44 Glenwo… 525    
##  2 A6CF8980A652… electr… 2022-01-10 08:41:56 2022-01-10 08:46:17 Glenwo… 525    
##  3 BD0F91DFF741… classi… 2022-01-25 04:53:40 2022-01-25 04:58:01 Sheffi… TA1306…
##  4 CBB80ED41910… classi… 2022-01-04 00:18:04 2022-01-04 00:33:00 Clark … KA1504…
##  5 DDC963BFDDA5… classi… 2022-01-20 01:31:10 2022-01-20 01:37:12 Michig… TA1309…
##  6 A39C6F6CC058… classi… 2022-01-11 18:48:09 2022-01-11 18:51:31 Wood S… 637    
##  7 BDC4AB637EDF… classi… 2022-01-30 18:32:52 2022-01-30 18:49:26 Oakley… KA1504…
##  8 81751A3186E5… classi… 2022-01-22 12:20:02 2022-01-22 12:32:06 Sheffi… TA1306…
##  9 154222B86A33… electr… 2022-01-17 07:34:41 2022-01-17 08:00:08 Racine… 13304  
## 10 72DC25B2DD46… classi… 2022-01-28 15:27:53 2022-01-28 15:35:16 LaSall… TA1309…
## # … with 5,667,707 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
dim(bike_rides)
## [1] 5667717      13
str(bike_rides)
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

4.2.1 Create date variables

bike_rides <- bike_rides %>% mutate(start_date = as.Date(bike_rides$started_at))
bike_rides <- bike_rides %>% mutate(end_date = as.Date(bike_rides$ended_at))
bike_rides <- bike_rides %>%
  mutate(
    Start_Yr = year(started_at),
    Start_Mth = month(started_at),
    Start_Day = wday(started_at),
    Start_Hr = hour(started_at)
  )
bike_rides <- bike_rides %>%
  mutate(
    End_Yr = year(ended_at),
    End_Mth = month(ended_at),
    End_Day = wday(ended_at),
    End_Hr = hour(ended_at))

4.2.2 Adding the Seasons column

bike_rides <- bike_rides %>% mutate(season = recode(Start_Mth,
                                                    `12` = "Winter",
                                                    `1` = "Winter",
                                                    `2` = "Winter",
                                                    `3` = "Spring",
                                                    `4` = "Spring",
                                                    `5` = "Spring",
                                                    `6` = "Summer",
                                                    `7` = "Summer",
                                                    `8` = "Summer",
                                                    `9` = "Fall",
                                                    `10` = "Fall",
                                                    `11` = "Fall"))
dim(bike_rides)
## [1] 5667717      24

4.2.3 Creating columns ride_length_sec, ride_length_min and ride_length_hr

bike_rides %>% filter(ended_at < started_at) %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   100
bike_rides %>% filter(ended_at > started_at) %>% count()
## # A tibble: 1 × 1
##         n
##     <int>
## 1 5667186
# creating column ride_length_sec (in seconds)
bike_rides <- bike_rides %>%
  mutate(ride_length_sec = ended_at - started_at)
# creating column ride_length_min (in minutes)
bike_rides <- bike_rides %>%
  mutate(ride_length_min = difftime(ended_at,started_at,units='mins',2))
# creating column ride_length_hour (in hours)
bike_rides$ride_length_hour <- round(as.numeric(difftime(bike_rides$ended_at, bike_rides$started_at, units = "hours")), 2)
str(bike_rides)
## tibble [5,667,717 × 27] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  $ start_date        : Date[1:5667717], format: "2022-01-13" "2022-01-10" ...
##  $ end_date          : Date[1:5667717], format: "2022-01-13" "2022-01-10" ...
##  $ Start_Yr          : num [1:5667717] 2022 2022 2022 2022 2022 ...
##  $ Start_Mth         : num [1:5667717] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Start_Day         : num [1:5667717] 5 2 3 3 5 3 1 7 2 6 ...
##  $ Start_Hr          : int [1:5667717] 11 8 4 0 1 18 18 12 7 15 ...
##  $ End_Yr            : num [1:5667717] 2022 2022 2022 2022 2022 ...
##  $ End_Mth           : num [1:5667717] 1 1 1 1 1 1 1 1 1 1 ...
##  $ End_Day           : num [1:5667717] 5 2 3 3 5 3 1 7 2 6 ...
##  $ End_Hr            : int [1:5667717] 12 8 4 0 1 18 18 12 8 15 ...
##  $ season            : chr [1:5667717] "Winter" "Winter" "Winter" "Winter" ...
##  $ ride_length_sec   : 'difftime' num [1:5667717] 177 261 261 896 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ride_length_min   : 'difftime' num [1:5667717] 2.95 4.35 4.35 14.9333333333333 ...
##   ..- attr(*, "units")= chr "mins"
##  $ ride_length_hour  : num [1:5667717] 0.05 0.07 0.07 0.25 0.1 0.06 0.28 0.2 0.42 0.12 ...

4.2.4 Exploring ride_length_min (in minutes) column in depth

sum(bike_rides$ride_length_min > 1440)
## [1] 5360

Less than 1 minute

sum(bike_rides$ride_length_min < 1)
## [1] 121089

To confirm any negative figure, use less than 0

sum(bike_rides$ride_length_min < 0)
## [1] 100

Greater than 6 hours(360 mins)

sum(bike_rides$ride_length_min > 360)
## [1] 10025

Removing ride_length_min > 24 hours and < 0 mins

bike_rides <- bike_rides %>% 
  filter(ride_length_min >= 0 & ride_length_min <= 1440)
dim(bike_rides)
## [1] 5662257      27

4.2.5 Renaming the columns

bike_rides <- rename(bike_rides, "bike_type" = "rideable_type", "user_type" = "member_casual")

4.3 Addressing missing values

# checking if Na values in start and end_station_id pertains to any user and bike type
bike_rides %>% filter(is.na(start_station_id)) %>%
  count(start_station_id, start_station_name, bike_type, user_type)
## # A tibble: 2 × 5
##   start_station_id start_station_name bike_type     user_type      n
##   <chr>            <chr>              <chr>         <chr>      <int>
## 1 <NA>             <NA>               electric_bike casual    347395
## 2 <NA>             <NA>               electric_bike member    485646
# we can see here that electric bikes with mixed user types are linked to na values in start station id
# we can see here that electric bikes are associated with na values in start station id codes
# checking for end station id codes
bike_rides %>% filter(is.na(end_station_id)) %>%
  count(end_station_id, end_station_name, bike_type, user_type)
## # A tibble: 5 × 5
##   end_station_id end_station_name bike_type     user_type      n
##   <chr>          <chr>            <chr>         <chr>      <int>
## 1 <NA>           <NA>             classic_bike  casual       103
## 2 <NA>           <NA>             classic_bike  member       415
## 3 <NA>           <NA>             docked_bike   casual       682
## 4 <NA>           <NA>             electric_bike casual    403587
## 5 <NA>           <NA>             electric_bike member    482735
# here three different bike types with both users pertains to na values
str(bike_rides)
## tibble [5,662,257 × 27] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5662257] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ bike_type         : chr [1:5662257] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5662257], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5662257], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5662257] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5662257] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5662257] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5662257] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5662257] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5662257] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5662257] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5662257] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ user_type         : chr [1:5662257] "casual" "casual" "member" "casual" ...
##  $ start_date        : Date[1:5662257], format: "2022-01-13" "2022-01-10" ...
##  $ end_date          : Date[1:5662257], format: "2022-01-13" "2022-01-10" ...
##  $ Start_Yr          : num [1:5662257] 2022 2022 2022 2022 2022 ...
##  $ Start_Mth         : num [1:5662257] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Start_Day         : num [1:5662257] 5 2 3 3 5 3 1 7 2 6 ...
##  $ Start_Hr          : int [1:5662257] 11 8 4 0 1 18 18 12 7 15 ...
##  $ End_Yr            : num [1:5662257] 2022 2022 2022 2022 2022 ...
##  $ End_Mth           : num [1:5662257] 1 1 1 1 1 1 1 1 1 1 ...
##  $ End_Day           : num [1:5662257] 5 2 3 3 5 3 1 7 2 6 ...
##  $ End_Hr            : int [1:5662257] 12 8 4 0 1 18 18 12 8 15 ...
##  $ season            : chr [1:5662257] "Winter" "Winter" "Winter" "Winter" ...
##  $ ride_length_sec   : 'difftime' num [1:5662257] 177 261 261 896 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ride_length_min   : 'difftime' num [1:5662257] 2.95 4.35 4.35 14.9333333333333 ...
##   ..- attr(*, "units")= chr "mins"
##  $ ride_length_hour  : num [1:5662257] 0.05 0.07 0.07 0.25 0.1 0.06 0.28 0.2 0.42 0.12 ...

4.3.1 Checking the missing data (Na) and removing it

summary(is.na(bike_rides))
##   ride_id        bike_type       started_at       ended_at      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5662257   FALSE:5662257   FALSE:5662257   FALSE:5662257  
##                                                                 
##  start_station_name start_station_id end_station_name end_station_id 
##  Mode :logical      Mode :logical    Mode :logical    Mode :logical  
##  FALSE:4829216      FALSE:4829216    FALSE:4774735    FALSE:4774735  
##  TRUE :833041       TRUE :833041     TRUE :887522     TRUE :887522   
##  start_lat       start_lng        end_lat         end_lng       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5662257   FALSE:5662257   FALSE:5661553   FALSE:5661553  
##                                  TRUE :704       TRUE :704      
##  user_type       start_date       end_date        Start_Yr      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5662257   FALSE:5662257   FALSE:5662257   FALSE:5662257  
##                                                                 
##  Start_Mth       Start_Day        Start_Hr         End_Yr       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5662257   FALSE:5662257   FALSE:5662257   FALSE:5662257  
##                                                                 
##   End_Mth         End_Day          End_Hr          season       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5662257   FALSE:5662257   FALSE:5662257   FALSE:5662257  
##                                                                 
##  ride_length_sec ride_length_min ride_length_hour
##  Mode :logical   Mode :logical   Mode :logical   
##  FALSE:5662257   FALSE:5662257   FALSE:5662257   
## 
colSums(is.na(bike_rides))
##            ride_id          bike_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             833041             833041             887522             887522 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                704                704 
##          user_type         start_date           end_date           Start_Yr 
##                  0                  0                  0                  0 
##          Start_Mth          Start_Day           Start_Hr             End_Yr 
##                  0                  0                  0                  0 
##            End_Mth            End_Day             End_Hr             season 
##                  0                  0                  0                  0 
##    ride_length_sec    ride_length_min   ride_length_hour 
##                  0                  0                  0

Since start/end station name and id and ending lat/lng data is immaterial, i would proceed removing them

bike_rides <- bike_rides %>% filter(is.na(start_station_name)==F)
bike_rides <- bike_rides %>% filter(is.na(end_station_name)==F)
bike_rides <- bike_rides %>% filter(is.na(end_lat)==F)
colSums(is.na(bike_rides))
##            ride_id          bike_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                  0                  0 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                  0                  0 
##          user_type         start_date           end_date           Start_Yr 
##                  0                  0                  0                  0 
##          Start_Mth          Start_Day           Start_Hr             End_Yr 
##                  0                  0                  0                  0 
##            End_Mth            End_Day             End_Hr             season 
##                  0                  0                  0                  0 
##    ride_length_sec    ride_length_min   ride_length_hour 
##                  0                  0                  0

Converting bike_type and user_type as factors

bike_rides$bike_type <- as.factor(bike_rides$bike_type)
bike_rides$user_type <- as.factor(bike_rides$user_type)

str(bike_rides)
## tibble [4,369,135 × 27] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4369135] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ bike_type         : Factor w/ 3 levels "classic_bike",..: 3 3 1 1 1 1 1 1 3 1 ...
##  $ started_at        : POSIXct[1:4369135], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:4369135], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:4369135] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:4369135] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:4369135] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:4369135] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:4369135] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:4369135] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:4369135] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:4369135] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ user_type         : Factor w/ 2 levels "casual","member": 1 1 2 1 2 2 2 2 2 2 ...
##  $ start_date        : Date[1:4369135], format: "2022-01-13" "2022-01-10" ...
##  $ end_date          : Date[1:4369135], format: "2022-01-13" "2022-01-10" ...
##  $ Start_Yr          : num [1:4369135] 2022 2022 2022 2022 2022 ...
##  $ Start_Mth         : num [1:4369135] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Start_Day         : num [1:4369135] 5 2 3 3 5 3 1 7 2 6 ...
##  $ Start_Hr          : int [1:4369135] 11 8 4 0 1 18 18 12 7 15 ...
##  $ End_Yr            : num [1:4369135] 2022 2022 2022 2022 2022 ...
##  $ End_Mth           : num [1:4369135] 1 1 1 1 1 1 1 1 1 1 ...
##  $ End_Day           : num [1:4369135] 5 2 3 3 5 3 1 7 2 6 ...
##  $ End_Hr            : int [1:4369135] 12 8 4 0 1 18 18 12 8 15 ...
##  $ season            : chr [1:4369135] "Winter" "Winter" "Winter" "Winter" ...
##  $ ride_length_sec   : 'difftime' num [1:4369135] 177 261 261 896 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ride_length_min   : 'difftime' num [1:4369135] 2.95 4.35 4.35 14.9333333333333 ...
##   ..- attr(*, "units")= chr "mins"
##  $ ride_length_hour  : num [1:4369135] 0.05 0.07 0.07 0.25 0.1 0.06 0.28 0.2 0.42 0.12 ...

5 Analyse

Key tasks

  1. Aggregate your data so it’s useful and accessible.
  2. Organize and format your data.
  3. Perform calculations.
  4. Identify trends and relationships.

Deliverable

A summary of your analysis

Let’s dive into analyzing our data

5.1 Members vs casual riders

At this phase, we will try differentiating member vs. casual riders to look for trends and patterns

5.1.1 Descriptive analysis

Let’s start with conducting descriptive analysis

# Removing rides greater than 24 hours as it is unlikely and less than 1 mins as they could be false start
# removing ride_length_min > 24 hours and < 1 mins
bike_rides <- bike_rides %>% filter(ride_length_min >= 1, ride_length_min < 1440)

Let’s see mean, median, max and min of our ride_length_min (minutes)

bike_rides %>%
  summarise(avg_ride_length = mean(ride_length_min), median_length = median(ride_length_min),
            max_length= max(ride_length_min), min_length = min(ride_length_min))
## # A tibble: 1 × 4
##   avg_ride_length median_length max_length    min_length
##   <drtn>          <drtn>        <drtn>        <drtn>    
## 1 17.26973 mins   10.8 mins     1439.367 mins 1 mins

Let’s summarise the descriptive stats by user_type

bike_rides %>% group_by(user_type) %>% 
  summarise(avg_ride_length = mean(ride_length_min), median_length = median(ride_length_min),
                                                 max_length= max(ride_length_min), min_length = min(ride_length_min))
## # A tibble: 2 × 5
##   user_type avg_ride_length median_length max_length    min_length
##   <fct>     <drtn>          <drtn>        <drtn>        <drtn>    
## 1 casual    24.07184 mins   14.08333 mins 1439.367 mins 1 mins    
## 2 member    12.67272 mins    9.15000 mins 1436.333 mins 1 mins
# it can be seen that casual riders have almost double the average ride length compared to members and took longer rides than members

Let’s see Total number of casual and member users

bike_rides %>%
  select(user_type) %>%
  group_by(user_type) %>%
  count() %>%
  arrange()
## # A tibble: 2 × 2
## # Groups:   user_type [2]
##   user_type       n
##   <fct>       <int>
## 1 casual    1731091
## 2 member    2561462
# there are 830,371 more member riders compared to casual riders

Let’s visualize this data using a plot

ggplot(data = bike_rides) + geom_bar(mapping = aes(x = user_type, fill = user_type)) +
  labs(title = "No of users", x = "user type", y = "Count")

# as can be seen there are more members compared to casual bike riders

Let’s see count of each bike type

bike_rides %>%
  group_by(bike_type) %>% 
  summarise(count = length(ride_id))
## # A tibble: 3 × 2
##   bike_type       count
##   <fct>           <int>
## 1 classic_bike  2558850
## 2 docked_bike    173241
## 3 electric_bike 1560462

Let’s also visualize usage of different bike types by the users

ggplot(data = bike_rides) + geom_bar(mapping = aes(x = bike_type, fill = bike_type)) + facet_wrap(~user_type) +
  labs(title = "Bike type usage", x = "Bike_Type", y = "Count")

# members never used docked bikes while used mostly classic bikes
# casual riders used docked bikes alongside classic and electric bikes

5.1.2 Analysing ride_length_min (in minutes) column thoroughly

Let’s see average ride length by user type

aggregate(bike_rides$ride_length_min ~ bike_rides$user_type, FUN = mean)
##   bike_rides$user_type bike_rides$ride_length_min
## 1               casual              24.07184 mins
## 2               member              12.67272 mins
# shows that casual riders had taken twice more rides compared to members

5.1.3 Summarising ride_length_min column

Let’s summarise ride_length_min by looking at ranges of minutes to see if anything stands out

bike_rides %>%
  group_by(user_type) %>%
  summarize("<=5min" = sum(ride_length_min <=5),
            "<=15min" = sum(ride_length_min <=15),
            "<=30min" = sum(ride_length_min <=30),
            "<=45min" = sum(ride_length_min <=45),
            "<=60min" = sum(ride_length_min <=60),
            ">2hrs" = sum(ride_length_min >120),
            ">4hrs" = sum(ride_length_min >240),
            ">6hrs" = sum(ride_length_min >360),) 
## # A tibble: 2 × 9
##   user_type `<=5min` `<=15min` `<=30min` <=45m…¹ <=60m…² `>2hrs` `>4hrs` `>6hrs`
##   <fct>        <int>     <int>     <int>   <int>   <int>   <int>   <int>   <int>
## 1 casual      172062    915674   1372926 1530708 1603595   31092    5346    3068
## 2 member      555911   1884360   2394716 2521329 2545510    3629    1339     866
## # … with abbreviated variable names ¹​`<=45min`, ²​`<=60min`

5.1.4 Rides interval analysis

I have done analysis in Google sheets to see how much % of rides data lies in each ride interval

df <- read.csv("ridesanalysis.csv",check.names = F)
view(df)
kable(df[1:5, ], caption = "Rides interval analysis.")
Rides interval analysis.
user_type “<=5min” % “<=15min” % “<=30min” % “<=45min” % “<=60min” % “>2hrs” % “>4hrs” % “>6hrs” % Totals %
Totals
casual 172062 1% 915674 6% 1372926 9% 1530708 10% 1603595 10% 31092 0% 5346 0% 3068 0% 5634471 36%
member 555911 4% 1884360 12% 2394716 15% 2521329 16% 2545510 16% 3629 0% 1339 0% 866 0% 9907660 64%
Total 727973 5% 2800034 18% 3767642 24% 4052037 26% 4149105 27% 34721 0% 6685 0% 3934 0% 15542131 100%
23% cum 73% cum 100% cum
# Key takeaways
# 100% of all the rides taken are 60 mins or less
# 73% of all the rides taken are 45 mins or less
# 23% of all the rides taken are 15 mins or less
# this data will greatly assist cyclistic in assesing the strategic direction

Let’s see the user type by count and percentage

bike_rides %>%
  group_by(user_type) %>%
  summarise(count = n(), Percentage = n()/nrow(bike_rides)*100)
## # A tibble: 2 × 3
##   user_type   count Percentage
##   <fct>       <int>      <dbl>
## 1 casual    1731091       40.3
## 2 member    2561462       59.7

Let’s visualize user type by total rides and by bike type in a plot

bike_rides %>%
  group_by(user_type, bike_type) %>%
  summarise(count = n()) %>%
  ggplot(aes(x=user_type, y=count, fill=bike_type)) +
  geom_bar(stat="identity", width = 0.4) +
  labs(x="Bike Type", y="Number of Rides", title = "Total Rides by user type and bike type")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

# only casual riders used docked bikes whereas members used only classic and electric bikes

5.2 Analysing rides in depth

I will analyse rides by users in sequence starting with minutes, hours, days, months and finally seasons.

Let’s create different ride length intervals alongside ride length interval column and then visualize

bike_rides <- bike_rides %>% mutate(ride_length_inter = case_when(
  ride_length_min <9.99 ~ "< 10 min",
  ride_length_min <14.99 ~ "< 15 min",
  ride_length_min >=15 & ride_length_min <=20.99 ~ "15-20 min",
  ride_length_min >=21 & ride_length_min <=30.99  ~ "21-30 min",
  ride_length_min >=31 & ride_length_min <=60.99  ~ "31-60 min",
  ride_length_min >=60 & ride_length_min <=120.99  ~ "61-120 min",
  ride_length_min >=121 & ride_length_min <=240.99  ~ "121-240 min",
  ride_length_min >=241  ~ "241+ min"))

5.2.1 Analysing rides by users in minutes

Let’s visualize the total rides taken by user type and ride length intervals

bike_rides %>%
  group_by(user_type, ride_length_inter) %>%
  summarise(count = n()) %>%
  ggplot(aes(x=factor(ride_length_inter, level = c("< 10 min", "< 15 min", "15-20 min" ,"21-30 min",   "31-60 min", "61-120 min", "121-240 min", "241+ min")), y=count, fill=user_type)) +
  geom_col(position = "dodge") +
  labs(x="Ride Length intervals", y="Number of Rides", title = "Total Rides by user type and ride length intervals")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

# key takeaways
# member riders took mostly shorter rides i.e. between 1-20 min rides
# casual riders took longer rides compared to members

Let’s see average or mean of ride length for “< 10 min” interval

bike_rides %>% filter(ride_length_inter == "< 10 min") %>%
  group_by(user_type) %>%
  summarize(avg_ride_length=mean(ride_length_min))
## # A tibble: 2 × 2
##   user_type avg_ride_length
##   <fct>     <drtn>         
## 1 casual    6.344479 mins  
## 2 member    5.790387 mins

Now average or mean of “< 20min” intervals

bike_rides %>% filter(ride_length_inter == "< 15 min" | ride_length_inter== "15-20 min") %>%
  group_by(user_type) %>%
  summarize(avg_ride_length=mean(ride_length_min))
## # A tibble: 2 × 2
##   user_type avg_ride_length
##   <fct>     <drtn>         
## 1 casual    14.65101 mins  
## 2 member    14.30318 mins
# casual riders on average took longer rides under the 10 min and 15-20 min intervals

5.2.2 Analysing rides by hour of day

Let’s visualize the demand for bikes based on hour of day

bike_rides %>%
  ggplot(aes(Start_Hr, fill= user_type)) +
  labs(x="Day Hour", title="Bike demand by day hour") +
  geom_bar()

# Key takeaways
# most demand for bikes is between 7am-8am and in the evening between 4pm-6pm for both user types

Let’s visualize total number of rides by users based on day hour

bike_rides %>% 
  group_by(user_type, Start_Hr) %>% 
  summarise(count = n()) %>%  
  arrange(user_type, Start_Hr) %>% 
  ggplot(aes(x=factor(Start_Hr, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), 
  y=count, fill=user_type)) + geom_col(position = "dodge") +
  labs(x="Day hour", y="No of Rides", title = "Total rides by users vs. day hour")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

# Key takeaways
# for both casual and member riders most number of rides are between 16pm-18pm

Let’s also visualize average ride length in mins by users based on day hour

bike_rides %>% 
  group_by(user_type, Start_Hr) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length_min)) %>% 
  arrange(user_type, Start_Hr) %>% 
  ggplot(aes(x=factor(Start_Hr, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)),
  y=average_ride_length, fill=user_type)) + geom_col(position = "dodge") +
  labs(x="Day hour", y="Avg ride length (in mins)", title = "Average rides by users and day hour")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

# Key takeaways
# casual riders took longer rides on average in mins peaking from 10am-3pm
# member riders have consistent average rides in mins throughout the day

5.2.3 Analysing rides by day time

Let’s add day time column

bike_rides <- bike_rides %>% mutate(day_time = case_when(
  Start_Hr >= 6 & Start_Hr < 9 ~ "Early Morning",
  Start_Hr >= 9 & Start_Hr < 12 ~ "Mid Morning",
  Start_Hr >= 12 & Start_Hr < 18  ~ "Afternoon",
  Start_Hr >= 18 & Start_Hr <= 23  ~ "Evening",
  Start_Hr >= 0 & Start_Hr < 3  ~ "Early Night",
  Start_Hr >= 3 & Start_Hr < 6  ~ "Late Night"))
bike_rides <- bike_rides %>% relocate(day_time, .before = season)

Let’s visualize total number of rides by users based on day time

axis_labels <- c("Early Morning \n6am-9am", "Mid Morning \n9am-12pm", "Afternoon \n12pm-6pm", "Evening \n6pm-11pm", "Early Night \n11pm-3am", "late Night \n3am-6am")
bike_rides %>% 
  group_by(user_type, day_time) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(day_time, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")),
             y=count, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Day time", y="Total Rides", title = "Total Rides by user type vs. Day time") +
  scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

# Key takeaways
# members took most number of rides early morning compared to casual riders
# both users took most rides in the afternoon and evening time

Let’s now visualize average ride length in mins by users based on day time

axis_labels <- c("Early Morning \n6am-9am", "Mid Morning \n9am-12pm", "Afternoon \n12pm-6pm", "Evening \n6pm-11pm", "Early Night \n11pm-3am", "late Night \n3am-6am")
bike_rides %>% 
  group_by(user_type, day_time) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length_min)) %>% 
  ggplot(aes(x=factor(day_time, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")),
             y=average_ride_length, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Day time", y="Avg ride length in mins", title = "Avg Ride length in mins by users vs. Day time") +
  scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

# Key takeaways
# casual riders on average took longer rides in minutes than members peaking in mid-morning and afternoon

5.2.4 Analysing rides by days

Let’s convert month and day data from numerical to categorical for analysis based on week day

bike_rides <- bike_rides %>%
  mutate(Start_Mth = format(as.Date(started_at), "%B")) %>%
  mutate(Start_Day = format(as.Date(started_at), "%A")) %>%
  mutate(End_Mth = format(as.Date(ended_at), "%B")) %>%
  mutate(End_Day = format(as.Date(ended_at), "%A"))
str(bike_rides)
## tibble [4,292,553 × 29] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4292553] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ bike_type         : Factor w/ 3 levels "classic_bike",..: 3 3 1 1 1 1 1 1 3 1 ...
##  $ started_at        : POSIXct[1:4292553], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:4292553], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:4292553] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:4292553] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:4292553] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:4292553] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:4292553] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:4292553] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:4292553] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:4292553] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ user_type         : Factor w/ 2 levels "casual","member": 1 1 2 1 2 2 2 2 2 2 ...
##  $ start_date        : Date[1:4292553], format: "2022-01-13" "2022-01-10" ...
##  $ end_date          : Date[1:4292553], format: "2022-01-13" "2022-01-10" ...
##  $ Start_Yr          : num [1:4292553] 2022 2022 2022 2022 2022 ...
##  $ Start_Mth         : chr [1:4292553] "January" "January" "January" "January" ...
##  $ Start_Day         : chr [1:4292553] "Thursday" "Monday" "Tuesday" "Tuesday" ...
##  $ Start_Hr          : int [1:4292553] 11 8 4 0 1 18 18 12 7 15 ...
##  $ End_Yr            : num [1:4292553] 2022 2022 2022 2022 2022 ...
##  $ End_Mth           : chr [1:4292553] "January" "January" "January" "January" ...
##  $ End_Day           : chr [1:4292553] "Thursday" "Monday" "Tuesday" "Tuesday" ...
##  $ End_Hr            : int [1:4292553] 12 8 4 0 1 18 18 12 8 15 ...
##  $ day_time          : chr [1:4292553] "Mid Morning" "Early Morning" "Late Night" "Early Night" ...
##  $ season            : chr [1:4292553] "Winter" "Winter" "Winter" "Winter" ...
##  $ ride_length_sec   : 'difftime' num [1:4292553] 177 261 261 896 ...
##   ..- attr(*, "units")= chr "secs"
##  $ ride_length_min   : 'difftime' num [1:4292553] 2.95 4.35 4.35 14.9333333333333 ...
##   ..- attr(*, "units")= chr "mins"
##  $ ride_length_hour  : num [1:4292553] 0.05 0.07 0.07 0.25 0.1 0.06 0.28 0.2 0.42 0.12 ...
##  $ ride_length_inter : chr [1:4292553] "< 10 min" "< 10 min" "< 10 min" "< 15 min" ...

Let’s order start_day column before proceeding to further analysis

bike_rides$Start_Day <- ordered(bike_rides$Start_Day, 
                                      levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Let’s see the average ride length in mins during only weekdays for both users

bike_rides  %>% filter(Start_Day != "Saturday" &
                         Start_Day != "Sunday") %>% group_by(user_type, Start_Hr) %>% 
  summarise(avg_ride_length_min = mean(ride_length_min),.groups="drop") %>% 
  ggplot(aes(x = Start_Hr, y = avg_ride_length_min)) + geom_point() + 
  geom_line(aes(group = user_type, colour = user_type)) + ylim(0, NA) + 
  ggtitle("Average ride length by hour on weekdays") + 
  labs(x = "Hour", y = "Avg ride length in mins")

# Key takeaways
# casual riders against members took longer rides on average during weekdays, peaking between 10am-12am

Let’s now see average ride length by users during weekends only

bike_rides  %>% filter(Start_Day == "Saturday" |
                         Start_Day == "Sunday") %>% group_by(user_type, Start_Hr) %>% 
  summarise(avg_ride_length_min = mean(ride_length_min),.groups="drop") %>% 
  ggplot(aes(x = Start_Hr, y = avg_ride_length_min)) + geom_point() + 
  geom_line(aes(group = user_type, colour = user_type)) + ylim(0, NA) + 
  ggtitle("Average ride length by hour on weekends") + 
  labs(x = "Hour", y = "Avg ride length in mins")

# Key takeaways
# on weekends casual riders took longer rides on average in mins peaking between 10am-15pm

Let’s analyse total rides and average rides taken by users based on the week day

bike_rides %>%
  group_by(user_type, Start_Day) %>%
  summarise(number_of_rides = n()
            ,avg_ride_length_min = mean(ride_length_min),.groups="drop") %>%
  arrange(user_type, Start_Day)
## # A tibble: 14 × 4
##    user_type Start_Day number_of_rides avg_ride_length_min
##    <fct>     <ord>               <int> <drtn>             
##  1 casual    Sunday             296567 27.39110 mins      
##  2 casual    Monday             207517 24.71689 mins      
##  3 casual    Tuesday            193403 21.60804 mins      
##  4 casual    Wednesday          200505 20.77590 mins      
##  5 casual    Thursday           226552 21.40514 mins      
##  6 casual    Friday             244964 22.52028 mins      
##  7 casual    Saturday           361583 26.84669 mins      
##  8 member    Sunday             291634 14.11635 mins      
##  9 member    Monday             368249 12.22724 mins      
## 10 member    Tuesday            403843 11.98885 mins      
## 11 member    Wednesday          405145 12.05923 mins      
## 12 member    Thursday           408192 12.22946 mins      
## 13 member    Friday             353085 12.45196 mins      
## 14 member    Saturday           331314 14.26232 mins

Let’s see average rides by users based on week day side by side

aggregate(bike_rides$ride_length_min ~ bike_rides$user_type + bike_rides$Start_Day, FUN = mean)
##    bike_rides$user_type bike_rides$Start_Day bike_rides$ride_length_min
## 1                casual               Sunday              27.39110 mins
## 2                member               Sunday              14.11635 mins
## 3                casual               Monday              24.71689 mins
## 4                member               Monday              12.22724 mins
## 5                casual              Tuesday              21.60804 mins
## 6                member              Tuesday              11.98885 mins
## 7                casual            Wednesday              20.77590 mins
## 8                member            Wednesday              12.05923 mins
## 9                casual             Thursday              21.40514 mins
## 10               member             Thursday              12.22946 mins
## 11               casual               Friday              22.52028 mins
## 12               member               Friday              12.45196 mins
## 13               casual             Saturday              26.84669 mins
## 14               member             Saturday              14.26232 mins
# Key takeaways
# casual riders took more rides on the weekends where average ride length is also much higher than members
# member riders have more number of rides on the weekdays which could mean commuting to work etc.

Let’s visualize number of rides taken by users based on the day of week

bike_rides %>%
  group_by(user_type, Start_Day) %>%
  summarise(number_of_rides = n(), .groups="drop") %>%
  arrange(user_type, Start_Day)  %>%
  ggplot(aes(x = Start_Day, y = number_of_rides, fill = user_type)) +
  labs(x="Week_day", y="number_of_rides", title ="Rides by users vs. Day of the week") +
  geom_col(width=0.4, position = position_dodge(width=0.4)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Let’s visualize the same by incorporating geom_point and line

bike_rides %>%
  group_by(user_type, Start_Day) %>%
  summarise(number_of_rides = n(), .groups="drop") %>%
  arrange(user_type, Start_Day)  %>%
  ggplot(aes(x=factor(Start_Day, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), 
             y=number_of_rides, color = user_type)) +
  geom_point() + geom_line(aes(group = user_type)) +
  labs(x="Week_day", y="number_of_rides", title ="Rides by users vs. Day of the week") +
  ylim(0, NA)

Let’s now visualize average rides taken by users based on week day

bike_rides %>%  
  group_by(user_type, Start_Day) %>% 
  summarise(average_ride_length = mean(ride_length_min), .groups="drop") %>%
  ggplot(aes(x = Start_Day, y = average_ride_length, fill = user_type)) +
  geom_col(width=0.4, position = position_dodge(width=0.4)) + 
  labs(x="Week_day", y="Avg_ride_length_mins", title ="Average ride length by users Vs. Week day")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

# Key takeaways
# casual riders on average took longer rides compared to members fluctuating between 20-25 mins
# members have consistent average rides throughout week i.e. between 10-15 min rides
# casual riders average ride length on weekends is much higher i.e. greater than 25 mins

5.2.5 Analysing rides by months

Let’s visualize total rides by users in each month

# first let's order the months data
bike_rides$Start_Mth <- ordered(bike_rides$Start_Mth, 
                                levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
bike_rides %>%  
  group_by(user_type, Start_Mth) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>% 
  arrange(user_type, Start_Mth)  %>% 
  ggplot(aes(x = Start_Mth, y = number_of_rides, fill = user_type)) +
  labs(title ="Total number of rides by users Vs. Month", x = "Month", y= "Total Rides") +
  theme(axis.text.x = element_text(angle = 50)) +
  geom_col(width=0.4, position = position_dodge(width=0.4)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Key takeaways
# members compared to casual riders have higher number of rides throughout the year
# casual riders took most number of rides in the summer season i.e. June and July
# members took most number of rides from May till September 

Let’s visualize average ride length in mins for both users by month

bike_rides %>%  
  group_by(user_type, Start_Mth) %>% 
  summarise(average_ride_length = mean(ride_length_min),.groups="drop") %>%
  arrange(user_type, Start_Mth)  %>% 
  ggplot(aes(x = Start_Mth, y = average_ride_length, fill = user_type)) +
  labs(title ="Average ride length by users Vs. Month", x = "Month", y= "Avg ride length in mins") +
  theme(axis.text.x = element_text(angle = 50)) +
  geom_col(width=0.4, position = position_dodge(width=0.4)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

# Key takeaways
# casual riders have more number of rides on average compared to members
# casual riders took most rides on average in minutes in the months March, April and May
# members took most number of rides on average in the months from May till September

Let’s visualize Total rides by each user type in each month side by side

bike_rides %>%
  group_by(user_type, Start_Mth) %>%
  summarise(number_of_rides = n(), .groups="drop") %>%
  arrange(user_type, Start_Mth)  %>%
  ggplot(aes(x = Start_Mth, y = number_of_rides,  fill = Start_Mth)) + 
  geom_col() + facet_wrap(~user_type) + 
  labs(title = "Total Rides per Month", x = "Month", y = "Total rides") +
  theme(axis.text.x = element_text(size = 4, angle = 20))

# Key takeaways
# casual riders had most number of rides in the summer period from June-August
# members have most number of rides from May-September as our previous analysis suggested

5.2.6 Analysing rides by seasons

Finally let’s now visualize rides by users based on seasons of the year

bike_rides %>% 
  group_by(user_type, season) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(season, level= c("Spring", "Summer", "Fall", "Winter")), y=count, fill=user_type)) + 
  geom_col(position = "dodge", width = 0.5) + 
  labs(x="Season", y="Total Rides", title = "Total Rides by user type vs. season")
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.

# Key takeaways
# members have more bike rides in total compares to casual riders in every season of the year
# casual and member riders have most rides in the summer season
# member riders used most bikes in the summer and fall season 

5.3 Analysing bike stations

Let’s now focus our analysis to bike start and end stations

# counting top 10 popular starting stations for casual riders
bike_rides %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(user_type == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% top_n(10)
## Selecting by count
## # A tibble: 10 × 2
##    start_station_name                 count
##    <chr>                              <int>
##  1 Streeter Dr & Grand Ave            54135
##  2 DuSable Lake Shore Dr & Monroe St  29776
##  3 Millennium Park                    23568
##  4 Michigan Ave & Oak St              23408
##  5 DuSable Lake Shore Dr & North Blvd 21829
##  6 Shedd Aquarium                     19119
##  7 Theater on the Lake                17131
##  8 Wells St & Concord Ln              14652
##  9 Dusable Harbor                     13071
## 10 Clark St & Armitage Ave            12611
# counting top 10 popular starting stations for members
bike_rides %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(user_type == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% top_n(10)
## Selecting by count
## # A tibble: 10 × 2
##    start_station_name           count
##    <chr>                        <int>
##  1 Kingsbury St & Kinzie St     23134
##  2 Clark St & Elm St            20191
##  3 Wells St & Concord Ln        19364
##  4 Clinton St & Washington Blvd 18399
##  5 Loomis St & Lexington St     17827
##  6 Clinton St & Madison St      17576
##  7 University Ave & 57th St     17236
##  8 Wells St & Elm St            17217
##  9 Ellis Ave & 60th St          17173
## 10 Broadway & Barry Ave         15927

5.3.3 Analysing bike ending stations

Confirm if the ending stations for users are also the same as starting stations

bike_rides %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(user_type == "casual") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% top_n(10) %>%
  mutate(end_station_name= fct_reorder(end_station_name, count)) %>% 
  ggplot(aes(x=end_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") + coord_flip() + scale_fill_gradient(low="blue", high="red") +
  labs(x="Ending Station Name", y="No of rides", title="Top 10 ending stations for casual riders")
## Selecting by count

bike_rides %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(user_type == "member") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% top_n(10) %>%
  mutate(end_station_name= fct_reorder(end_station_name, count)) %>% 
  ggplot(aes(x=end_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") + coord_flip() + scale_fill_gradient(low="blue", high="red") +
  labs(x="Ending Station Name", y="No of rides", title="Top 10 ending stations for member riders")
## Selecting by count

# key takeaways
# most popular starting and ending stations for casual and member riders are the same

5.3.4 Bike start/end stations based on most round trips taken

bike_rides %>%
  group_by(start_station_name, end_station_name) %>%
  filter(start_station_name!="NULL") %>%
  summarize(count=n()) %>% ungroup %>%
  arrange(-count) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
## Selecting by count
## # A tibble: 10 × 3
##    start_station_name                end_station_name                  count
##    <chr>                             <chr>                             <int>
##  1 Streeter Dr & Grand Ave           Streeter Dr & Grand Ave           10862
##  2 Ellis Ave & 60th St               University Ave & 57th St           6796
##  3 DuSable Lake Shore Dr & Monroe St DuSable Lake Shore Dr & Monroe St  6705
##  4 University Ave & 57th St          Ellis Ave & 60th St                6362
##  5 Ellis Ave & 60th St               Ellis Ave & 55th St                6359
##  6 Ellis Ave & 55th St               Ellis Ave & 60th St                5716
##  7 DuSable Lake Shore Dr & Monroe St Streeter Dr & Grand Ave            5612
##  8 Michigan Ave & Oak St             Michigan Ave & Oak St              4936
##  9 Millennium Park                   Millennium Park                    3891
## 10 State St & 33rd St                Calumet Ave & 33rd St              3474
# Key takeaways
# Streeter Dr & Grand Ave has the most round trips taken by users with more than 10,000 rides
# Ellis Ave & 60th St to University Ave & 57th St has the second most round trips, could entail that students taking these rides

5.4 Map analysis

Let’s now do analysis based on map and coordinates in order to see location based rides

chicago.lines <- bike_rides %>% filter(start_lng != end_lng & start_lat != end_lat) %>%
  group_by(user_type,
           bike_type,
           start_station_id,
           start_lng,
           start_lat,
           end_lng,
           end_lat,
           start_station_name,
           end_station_name) %>%
  summarize(rides = n(),.groups="drop") %>%
  filter(rides > 100)
# creating 2 data frames based on usertype only
casuals <- chicago.lines %>% filter(user_type == "casual")
members <- chicago.lines %>% filter(user_type == "member")

# setting coordinates data and fetching stamen map for visualization
chicago <- c(left = -87.8, 
                             bottom = 41.8, 
                             right = -87.6, 
                             top = 42.1)
chicago_map <- get_stamenmap(bbox = chicago, maptype = "terrain", zoom = 12)
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.

5.4.1 Visualising ride activity in chicago map for casual riders

ggmap(chicago_map) + 
  geom_point(casuals, mapping=aes(x=start_lng,y=start_lat,color=bike_type),size=2)+
  coord_fixed(.7) + 
  xlab('')+ylab('') +
  ggtitle("Ride activity by casual riders")
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 56 rows containing missing values (`geom_point()`).

## Key takeaways
# casual riders had most rides near the shore or bay area, implying that tourists are using them

5.4.2 Visualising ride activity for member riders in chicago map

ggmap(chicago_map,darken = c(0.1)) + 
  geom_point(members, mapping=aes(x=start_lng,y=start_lat,color=bike_type),size=2)+
  coord_fixed(.7) + 
  xlab('')+ylab('') +
  ggtitle("Ride activity by member riders")
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 139 rows containing missing values (`geom_point()`).

## Key takeaways
# member riders are more spread out in the city, could imply that they take rides for work travel purposes

6 Share and Act

Key tasks

  1. Determine the best way to share your findings.
  2. Create effective data visualizations.
  3. Present your findings.
  4. Ensure your work is accessible.

Deliverable

Supporting visualizations and key findings

This phase showing key findings and top recommendations alongside further visualizations in dashboard is done in Tableau public

Link to Tableau public dashboard : link

6.1 Exporting the dataframe

It is time to export and store our cleaned data for further analysis in Tableau

write.csv(bike_rides, "bike_rides.csv")
# i will drop few columns that are not needed because of tableau public file size limit
drop_1 <- c("end_lat", "end_lng", "End_Month","End_Yr", "End_Day", "started_at", "ended_at", "ride_length_hour")
bike_rides_1 <- bike_rides[ , !(names(bike_rides) %in% drop_1)]
write.csv(bike_rides_1, "bike_rides_drop.csv")